EDISON: Feature Extraction for NLP, Simplified
نویسندگان
چکیده
When designing Natural Language Processing (NLP) applications that use Machine Learning (ML) techniques, feature extraction becomes a significant part of the development effort, whether developing a new application or attempting to reproduce results reported for existing NLP tasks. We present EDISON, a Java library of feature generation functions used in a suite of state-of-the-art NLP tools, based on a set of generic NLP data structures. These feature extractors populate simple data structures encoding the extracted features, which the package can also serialize to an intuitive JSON file format that can be easily mapped to formats used by ML packages. EDISON can also be used programmatically with JVM-based (Java/Scala) NLP software to provide the feature extractor input. The collection of feature extractors is organised hierarchically and a simple search interface is provided. In this paper we include examples that demonstrate the versatility and ease-of-use of the EDISON feature extraction suite to show that this can significantly reduce the time spent by developers on feature extraction design for NLP systems. The library is publicly hosted at https://github.com/IllinoisCogComp/illinois-cogcomp-nlp/, and we hope that other NLP researchers will contribute to the set of feature extractors. In this way, the community can help simplify reproduction of published results and the integration of ideas from diverse sources when developing new and improved NLP applications.
منابع مشابه
FudanNLP at RITE 2011: a Shallow Semantic Approach to Textual Entailment
RITE is a task recognizing logic relations between texts. This paper presents FDCS’s approach on NTCIR9-RITE Chinese simplified BC & MC subtasks. Our system is built on a machine learning architecture with features selected on shallow semantic methods, including named entity recognition, date & time expression extraction, words overlap and negation concept recognition. FudanNLP is wildly used b...
متن کاملA New Method for Improving Computational Cost of Open Information Extraction Systems Using Log-Linear Model
Information extraction (IE) is a process of automatically providing a structured representation from an unstructured or semi-structured text. It is a long-standing challenge in natural language processing (NLP) which has been intensified by the increased volume of information and heterogeneity, and non-structured form of it. One of the core information extraction tasks is relation extraction wh...
متن کاملCommon Spatial Patterns Feature Extraction and Support Vector Machine Classification for Motor Imagery with the SecondBrain
Recently, a large set of electroencephalography (EEG) data is being generated by several high-quality labs worldwide and is free to be used by all researchers in the world. On the other hand, many neuroscience researchers need these data to study different neural disorders for better diagnosis and evaluating the treatment. However, some format adaptation and pre-processing are necessary before ...
متن کاملFextor: A Feature Extraction Framework for Natural Language Processing: A Case Study in Word Sense Disambiguation, Relation Recognition and Anaphora Resolution
Feature extraction from text corpora is an important step in Natural Language Processing (NLP), especially for Machine Learning (ML) techniques. Various NLP tasks have many common steps, e.g. low level act of reading a corpus and obtaining text windows from it. Some high-level processing steps might also be shared, e.g. testing for morphosyntactic constraints between words. An integrated featur...
متن کاملAn NLP Curator (or: How I Learned to Stop Worrying and Love NLP Pipelines)
Natural Language Processing continues to grow in popularity in a range of research and commercial applications, yet managing the wide array of potential NLP components remains a difficult problem. This paper describes CURATOR, an NLP management framework designed to address some common problems and inefficiencies associated with building NLP process pipelines; and EDISON, an NLP data structure ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2016